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Abstract 

Widespread interest in the diffusion of information through social networks has produced a large number of Social Dynamics models. A 
'■j^majority of them use theoretical hypothesis to explain their diffusion mechanisms while the few empirically based ones average out their 

measures over many messages of different content. Our empirical research tracking the step-by-step email propagation of an invariable viral 
O marketing message delves into the content impact and has discovered new and striking features. The topology and dynamics of the propagation 

cascades display patterns not inherited from the email networks carrying the message. Their disconnected, low transitivity, tree-like cascades 

present positive correlation between their nodes probability to forward the message and the average number of neighbors they target and show 
. ^ increased participants' involvement as the propagation paths length grows. Such patterns not described before, nor replicated by any of the 
C/) existing models of information diffusion, can be explained if participants make their pass-along decisions based uniquely on local knowledge 
>>f their network neighbors affinity with the message content. We prove the plausibility of such mechanism through a stylized, agent-based 



model that replicates the Affinity Paths observed in real information diffusion cascades. 
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1. Introduction and Background 

The discovery of quantitative laws in the collective proper- 
ties of large numbers of people, for example the birth and death 
rates or crime frequencies, was one of the factors pushing the 
development of statistics and led scientists and philosophers to 
call for some quantitative understanding on how such precise 
regularities stem from the apparently erratic behavior of indi- 
viduals. Hobbes, Laplace, Comte, Stuart Mill and many others 
shared, to a different extent, this line of thought (Ball, 2004). 
The question to investigate was how the interactions between 
social agents create order in their behavior from an initially 
disordered state. The basic premise was that agents' repeated 
interactions should make people more similar since the infor- 
mation exchanges involved led to higher degrees of homogene- 
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ity in values, thoughts or preferences. The dynamic nature of 
the information diffusion, the poor understanding of the human 
behavior causes and the fact that the agents interactions take 
place in the thick of complex social networks, made the Social 
Dynamics problem largely untractable for a long time. 

The appearance of new social phenomena related to the In- 
ternet (Social Media, Collaborative Filtering, Social Tagging...) 
whose interactions can be captured in large databases and the 
tendency of social scientists to move toward the formulation 
of simplified models and their quantitative analysis, have ush- 
ered in an era of scientific research in the field of Social Dy- 
namics (Lazer et al., 2009). Several key questions have been 
posed: What favors the homogenization process? What hinders 
it? What are the fundamental interaction mechanisms fostering 
the adoption of innovations, the spreading of rumors, the evo- 
lution towards a dominant opinion or the emergence of trends 
and fashions? 

Initially, the difficulty in obtaining micro-level data on the 
diffusion of information between individuals, the absence of 
suitable mathematical algorithms to rigorously analyze the phe- 
nomena and the calculation complexity involved in simulations 
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with large real networks limited theoretical advancements to the 
construction of population average diffusion models based on 
master or differential equations. Those models were in general 
borrowed from mathematical epidemiology (Hethcote, 2000) 
since it was assumed that information would propagate just like 
diseases do. However information diffusion research has deeply 
evolved since step-by-step tracking of interactions through elec- 
tronic media made detailed diffusion data plentiful (although 
not necessarily accessible or easy to gather). 

The development of the science of complex systems and 
advancements in the computerized treatment of Social Net- 
work Analysis methods have spurred the emergence of a "new" 
science of networks (Watts, 2004) which provides more ro- 
bust tools for the scientific treatment of social dynamics pro- 
cesses. As a result scientists realized that information spreading 
mechanisms vary with the type of information which spawned 
a rush to develop the appropriate model for each. Accord- 
ing to their algorithmic approach those models can be catego- 
rized as population-average or network-based. The population- 
average models assume fully-mixed or homogeneous substrate 
networks and describe the agents' social dynamic behavior at 
the aggregate level through differential or master equations. 
Examples of those are the seminal "two-step influence model" 
of information diffusion by Katz and Lazarsfeld (1955), the 
rumor diffusion model of Daley and Kendall (1965), the in- 
novations adoption model of Bass (1969), its stochastic ver- 
sion by Niu (2002), the minority spreading opinion formation 
model of Galam (2002), the innovation diffusion model with 
influentials and imitators of Van den Bulte and Joshi (2007) or 
the percolation-based product Ufecycle model of Frenken et al. 
(2008). On the other hand, network-based models include the 
influence of the underlying social network topology by way of 
agent-based stochastic algorithms. Some examples of them are 
the classic innovation adoption "threshold model" of Granovet- 
ter (1978), the model of diffusion of technological innovations 
with upgrading costs of Guardiola et al. (2002), the fads and 
fashion formation model of Bettencourt (2002), models on the 
impact of the structural characteristics of a network on innova- 
tions diffusion (Jackson and Yariv, 2005; Liu et al, 2005), the 
stochastic model for opinion formation of Sznajd-Weron (2005) 
or the network variant of the Daley-Kendall rumor model by 
Nekovee et al. (2007). 

However, this profusion of theoretical models was mainly 
justified by plausibility arguments and Social Dynamics mod- 
els based on empirical data are still scarce. A few examples are 
the referral networks study of Vilpponen et al. (2006) which 
found that the structure of electronic communication networks 
is different from that of the traditional interpersonal communi- 
cation ones, the chain-letter diffusion research of Liben-Nowell 
and Kleinberg (2008) whose strikingly long and narrow spread- 
ing chains were attributed to a new mechanism involving asyn- 
chronous response times of the forwarders or the study on in- 
formation diffusion through blogs of Gomez-Rodriguez et al. 
(2010) which found a core periphery structure in the blogo- 
sphere news diffusion network. Nevertheless, aU these studies 
could only trace the propagation of messages with varying con- 



tent and are unable to discriminate the propagation of individual 
content items. As a result, none of them could study the impact 
of the information content on the diffusion processes. While 
the lack of insight into the content impact would be expected 
of past century information diffusion research, its absence in 
more recent literature can only be explained because propaga- 
tion data at the individual level, being usually proprietary be- 
cause of its economic value or usage restrictions, is kept under 
tight wraps and results very hard to obtain. 

Our research addresses such shortcoming. Unlike the works 
cited that study information propagation through the aggregate 
effect of propagating messages of varying content, ours tracked 
the precise paths of a viral marketing campaign fixed and in- 
variable message as it spread through an email social network. 
The message content remained identical through the propaga- 
tion. This allowed us to scrutinize the individuals' reactions 
to a particular message instead of just averaged out behavior 
over diverse information items. By discriminating all factors 
impacting the participants' spreading pattems from the mes- 
sage content we were able to detect the effects produced by 
the latter. We found that the message diffusion cascades evolve 
through a branching process that presents some characteristic 
and unique pattems unexamined until now although some lit- 
erature (Leskovec et al., 2007; Watts and Peretti, 2007) has 
shown an inkling of them. We noticed a steady increase in the 
spreaders' activity parameters as the message gets deeper in the 
propagation cascades. This surprising pattern can not be ob- 
served in empirical experiments collecting propagation data of 
varying content messages. It can be explained if the cascades 
growth stems from a mechanism based on the affinity between 
the message content and the preferences of those receiving it 
and not on the receiving node neighbors' status or on the un- 
derlying social network structure used in many of the current 
models. We test and validate that hypothesis through a stylized 
agent-based propagation model. The rest of the article is orga- 
nized as follows: First we describe the data obtained from our 
empirical research on real viral marketing campaigns and the 
control parameters of their messages propagation. Second, we 
present our findings on the structure and growth pattems of the 
information cascades. Third we introduce the message affinity 
propagation model and compare its predictions with the empir- 
ical results. The article ends with our conclusions. 

2. Word-Of-Mouth diffusion research 

We tracked and measured the "word-of-mouth" diffusion 
of viral marketing campaigns ran in eleven European markets 
which offered incentives to current subscribers of an IT com- 
pany online newsletter to promote new subscriptions through 
reconnmendation emails to friends and coUeagues. The cam- 
paigns were entirely web based: banner ads, emails, search en- 
gines and the company homepage drove participants into the 
campaign site. In it, participants accessed a referral form to reg- 
ister themselves and enter the addresses of those to whom they 
recommended subscribing the newsletter. The submission of 
this form triggered a personalized, but otherwise identical, rec- 
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Market 


N 






p 


Arcs 


Case. 




France 


11,758 


3,247 


524 


7,987 


8,593 


3,248 


139 


DE+AT 


7,943 


1,760 


567 


5,616 


6,239 


1,750 


146 


Spain 


5,260 


855 


505 


3,900 


4,454 


843 


122 


Nordic 


2,509 


530 


176 


1,803 


2,004 


524 


34 


UK+NL 


2,111 


521 


107 


1,483 


1,618 


518 


25 


Italy 


1,602 


323 


108 


1,171 


1,324 


319 


41 


All markets 


31,183 


7,225 


2,002 


21,956 


24,207 


7,188 


146 



Table 1 

Campaigns propagation data set: Count of Total Nodes (N), Seed Nodes 
(Ns), Viral Nodes (iV,,), Passive Nodes (Np), Total directed links (Arcs), 
and Total of Independent Cascades (Case.) measured on tiie campaigns 
propagadon network. .v„ia.v is the largest cascade size by its number of nodes. 
Quantities in All markets may not add up to tlie sum of their column because 
network partition removes inter-country links. The number of Seed Nodes 
(Ns) may not coincide with that of cascades due to cascades merging with one 
another during the propagation or because, sometimes, a Seed Node can not 
be identified (for example in the case of recommendation reciprocity between 
two nodes). Results in some countries are aggregated in homogeneous markets 
for statistical significance. Nordic includes DK, FI, NO and SE. 

ommendation message with a link to the campaign registration 
form. The link customized URL was appended with codes al- 
lowing to uniquely trace cUcks on it to sender and addressee of 
the corresponding email. The form checked email addresses for 
syntax correctness and to prevent self recommendations. Cook- 
ies in the participants' email client prevented sending multiple 
recommendations to the same address ^ and improved the user 
experience by pre-filling the sender's profile in subsequent ses- 
sions. Additionally, the campaign web server registered a time 
stamp for each of the process steps (subscription, recommenda- 
tions, referral link clicks) and removed from records referrals 
to undeliverable email addresses. 

The incentive offered to recommenders was the possibility 
of winning laptop computers in a lottery to be held at the end of 
the campaign period. Aside from the obvious goal of increas- 
ing participation, the incentive mission was twofold: Firstly, 
discourage indiscriminate referrals to prevent spamming-like 
behavior and, secondly, ensure legal cover for the tracking of 
sender-receiver data. To accomplish such requirement, partici- 
pation in the lottery was limited to the so-called successful re- 
ferrals defined as the recommendation emails whose recipients 
clicked on the coded URL included on them. Thus, the more 
referral emails sent to recipients opening them and visiting the 
campaign site, the higher the sender's winning odds. More im- 
portantly, both sender and receiver of any successful referral 
drawn in the lottery were entitled to receive the lottery prize. 
Terms and conditions, accessible from all web site pages and 
referral emails, specified that participation in the lottery implied 
the sender's and receiver's approval of the campaign registra- 
tion of their email transaction details as this was necessary to 
ensure that both parties could receive the prize if their referral 



^ However, participants with cookies disabled could send multiple referrals 
to the same person. Thus 183 referrals (0.76% of total) were discarded 



email was the winning one. Subscription to the newsletter was 
not required to participate in the prize draw. Campaigns ran in 
each country local language but were identical otherwise: Iden- 
tical message, incentive, eligibility rules, lottery mechanism, 
campaign duration, web user interface and tracking processes. 
This homogeneity of data ensured that behavioral differences 
between countries were not caused by the campaigns execution 
but due to the market specifics. It also vahdates the analysis of 
country aggregated results. 

2.1. Campaigns propagation data set 

Spurred by the campaign sponsor web site and exogenous 
online advertising, a total of 7,225 individuals initiated message 
diffusion cascades which grew through viral pass-along driven 
by 2,002 secondary spreaders. Thus, the viral offering touched 
another 21,956 passive nodes who did not forward it further. 
All in all, 31,183 individuals of whom 9,227 were spreaders, 
received the viral message. Thus 77% of the individuals re- 
ceived the message through the endogenous viral propagation 
mechanism. The Cascades Network resulting of the message 
diffusion constitutes a directed graph with 7,188 independent 
cascades whose nodes represent participants linked by 24,207 
directed arcs representing the recommendation emails. We call 
Seed Nodes (Ns) the individuals who spontaneously initiate rec- 
ommendation cascades from the campaign site without hav- 
ing received a recommendation message from others and Viral 
Nodes (Ny) those who forward a previously received message. 
Table 1 presents the summary data set of the campaigns mes- 
sage propagation ^ . Unsuccessful emails, disconnected nodes, 
nodes with invalid or undeliverable email addresses, loops and 
multiple referrals between same nodes were discarded. In com- 
pliance with the sponsor rigorous policy, all personal informa- 
tion was codified and masked to guarantee the participants' 
privacy protection. 

2.2. Cascades Network structural metrics 

Here we examine differences and similarities between the 
Cascades Network topology and that of the reported email net- 
works through which they propagate. Table 2 shows the Cas- 
cades Network structural parameters measured without con- 
sidering links direction. The cumulative distribution function 
(c.d.f) of the undirected network total degree A: is a power-law 
P{k) ^ k^^-^ whose significant probability of very connected 
nodes evidences higher heterogeneity than the exponential de- 
gree distributions found in some email networks (Guimera et al., 
2003; Newman et al., 2002). However, their heterogeneity is 
less marked than that of the email network studied by Ebel 
et al. (2002) whose power-law degree distribution (p.d.f .) of ex- 
ponent Yi^ = l.El is fatter tailed. Additionally, email networks 
present positive correlations between the nodes degree at either 
end of an edge, a property called degree assortativity and mea- 
sured, according to Newman (2002), by the Pearson correlation 

^ The time dynamics of the message diffusion is covered on a separate paper 
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France 


1.46 


1.594 


3.99 


0.0000 


0.00012 


2.164 


8 


DE+AT 


1.57 


2.027 


5.59 


0.0049 


0.00020 


2.671 


7 


Spain 


1.69 


2.383 


7.17 


0.0054 


0.00032 


3.287 


9 


Nordic 


1.60 


1.575 


4.07 


0.0077 


0.00064 


2.243 


5 


UK+NL 


1.53 


1.364 


3.43 


0.0112 


0.00073 


2.026 


5 


Italy 


1.65 


1.918 


5.22 


0.0234 


0.00103 


2.229 


6 


All markets 1.55 


1.868 


4.97 


0.0048 


0.00005 


2.671 


9 



Table 2 

Cascades Network Structural Metrics: k (total degree) is the average of in- 
or out-links of a node, its standard deviation, k„„ the mean of the nearest 
neighbors average total degree, C the Clustering coefficient. Grand =k/N the 
corresponding value for an equivalent random network, 1 the average shortest 
path length between reachable nodes (links considered undirected) and g,„a,x 
the maximum number of steps in directed propagation paths. 

coefficient. For example, the degree correlation coefficient in 
the email network of Guimera et al. (2003) is p^- = +0.188, 
indication of a correlated network. The equivalent for the Cas- 
cades Network = —0.001 shows total uncorrelation. Besides, 
in networks with skewed node degree distributions and degree 
correlations, such as the email networks, the average connec- 
tivity of the network k is typically lower than that of the near- 
est neighbors of a node A;„„. For example in the Guimera et al. 
(2003) email network, the ratio k„n/k is approximately 2. Such 
phenomenon is responsible for the first neighbors of a node 
having in average more contacts than such node or, quoting 
Feld (1991), for the fact that "your friends always have more 
friends than you do." Interestingly, this feature is more marked 
in the Cascades Network whose kn„ to k ratio ranges from 2.24 
in UKh-NL to 4.24 in Spain. 

Another difference between the Cascades Network and the 
email networks through which they propagate lies in their tran- 
sitivity, a property typical of acquaintance networks whereby 
two individuals with a common friend are more likely than av- 
erage to know each other The Clustering coefficient C, defined 
as the fraction of all triangles found in the network relative to 
the total number of triads measures the transitivity. Table 2 
shows that our Cascades Networks with a Clustering coefficient 
C = 4.8 X 10^^ for the graph of All markets are highly intran- 
sitive yet ten times more transitive than an equivalent random 
network of the same size and connectivity. In any case, a very 
low value compared to the range C [0.15 - 0.60] found in so- 
cial or email networks (Newman and Park, 2003). Probabilistic 
considerations show the logic of such feature: since the Cas- 
cades Network percolates its underlying email network only 
partially, the dyadic closure that builds clustering in the former 
must be just a fraction of the one in the latter As a result our 
campaigns viral diffusion cascades, like the one in Fig. 1, are 
almost pure trees. 



A triad is a group of three nodes connected by two links 




Fig. 1. Tree-like Propagation Cascades: The viral messages diffusion graph 
of our campaigns consists of disconnected cascades as this one observed in 
Spain. Its 7 generations and 122 nodes stem from the node labeled Seed and 
grow through secondary propagation driven by Viral Nodes A, B and C which 
constitute 50% of generation 1. Nodes color-coded by their out-degree. The 
nodes at the end of each path are inactive (out-degree is zero) and do not 
intervene in the analysis of Section 3.1 which refers to nodes with non-zero 
in- and out-degree (the Viral Nodes). Notice the tree-like structure devoid 
of closed paths or triangles (C = 0). The average total degree of this tree is 
k = 1.984 and its largest undirected path (diameter) d = 13. 

The last distinctive property of email networks, the Small 
World or low average shortest path length (Boccaletti et al., 
2006), although seemingly present since £ — 2.67 (Table 2) and 
lower than that of email networks lemari ~ 3.5 (Eckmann et al., 
2004; Guimera et al., 2003) is not comparable with those due to 
the nature of the Cascades Network that, split in many discon- 
nected components, limits paths calculation to reachable pairs 
of nodes which necessarily yields lower values. The distribution 
of those cascades size (.?), like the total degree, is a very skewed 
power-law whose c.d.f. exponent is % — 1.35. With largest cas- 
cade size — 146 nodes, mean size s — 4.33, and Os = 5.27, 
the cascade in Fig. 1 is 25 times more likely to appear in our 
campaigns than in percolation through a random network ^ . 

In consequence, the viral Cascades Network topology lacks 
all the four key features of email networks (fat tailed node 
degree distribution, nodes degree correlations, high clustering 
and the Small World property) and can not be formally char- 
acterized as a social network. This is quite logical since the 
viral propagation cascades of diffusion processes far from sat- 
uration, such as ours, overlay just sections of the underlying 
email network and, as a result, can only unveil a small portion 
of it. Paraphrasing Liben-Nowell and Kleinberg (2008) in their 
study of chain-letters propagation, it is as if "the progress of 
the viral messages had a type of stroboscopic effect serving to 
briefly light up the structure of the global email network." Un- 
fortunately, not having any details on the topology of the email 
network substrate, we can not judge the extent of its influence 
on the Cascades Network topology. 



^ The tail of the cascade size distribution in large random networks near the 
transition to the giant component goes as ~ .v"'/^ (Albert and Barabasi, 
2002) and the probability of a cascade of size 122 is ~ 6.1 x 10^*. 
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Market 


X 




rv 


7v SEM 




s 


s* 


%Dev. 


France 


0.062 


2.21 


2.50 


0.1023 


0.154 


3.62 


3.61 


-0.22 


DE+AT 


0.092 


2.48 


3.06 


0.1155 


0.281 


4.54 


4.45 


-2.04 


Spain 


0.115 


3.16 


3.45 


0.1909 


0.400 


6.24 


6.23 


-0.20 


Nordic 


0.089 


2.82 


2.91 


0.1836 


0.259 


4.79 


4.81 


+0.31 


UK+NL 


0.067 


2.49 


2.87 


0.2398 


0.236 


4.08 


4.09 


+0.15 


Italy 


0.084 


2.87 


2.80 


0.2301 


0.236 


5.02 


4.76 


-5.20 


All markets 0.083 


2.51 


2.96 


0.065 


0.246 


4.34 


4.33 


-0.30 



Table 3 

Cascades growth dynamic parameters: Transmissibility (A), Fanout Co- 
efficients of Seed (Fj) and Viral (r^) nodes, Standard error of the Viral 
Nodes Fanout coefficient (Jy SEM), Basic Reproductive Number for sec- 
ondary spreaders (Rq) and average Cascade size (s) by market as measured 
in the campaigns. In the last two columns .v* is the average Cascade size 
predicted by the Galton- Watson Branching model Eq. (4) and %Dev. the 
deviation of that prediction from the actual measurements. 

2.3. Cascades Network Dynamic Parameters 

While the structure of the undirected cascades is weakly re- 
lated to that of the email network substrate, the flow of mes- 
sages in the Cascades Network is not (except for the substrate 
network setting the boundary conditions) and fully depends on 
the recommendation mechanism. To study it we will consider 
the distribution of recommendation emails sent by spreaders of 
the viral message which offers a better picture of the cascades 
dynamics than the total node degree k considered so far be- 
cause 70% of the network nodes are inactive. This new variable, 
equivalent to the out-degree of the network nodes, is measured 
separately for Seed Nodes and Viral Nodes and designated as 
r^ and r^ respectively. While most Viral Nodes sent just a few 
recommendations a significant fraction displayed a very intense 
activity: thus for the ensemble of All markets in our dataset, the 
mean of the number of recommendations sent by Viral Nodes, 
the so-called Fanout Coefficient, was r^ = 2.96 (see Table 3), 
its standard deviation f7,, = 7.47 and the highest number of rec- 
ommendations sent by a single individual ry{max) = 72. Its 
distribution can be fitted to a fat tailed power-law of the form 



(1) 



whose parameters for the All markets network take the values 
= 11-6. oc = 2.83 and j3 = 10.96 using Maximum Likeh- 
hood Estimation. 

We can visualize the cascades of a viral propagation process 
growing through successive layers, or generations, as nodes 
reached in one generation resend the message to nodes in the 
next generation. The latter nodes constitute the off-spring of 
the earlier ones in an evolution of the propagation trees whose 
node-level dynamics is well described by the Galton- Watson 
Branching model ^ (Harris, 2002). Two parameters fully de- 



* A markovian model of a population where each individual in generation g 
produces in generation g+1 a random number of individuals extracted from 



scribe this growth process at the population level: the aforemen- 
tioned Fanout Coefficient Ty and the message Transmissibility 
A defined as the fraction of the touched nodes that become 
secondary spreaders. The Transmissibility results from data in 
Table 1 as 



A = 



Ny 



N-Ns 



(2) 



and both parameters combine to yield the Basic Reproductive 
Number Rq or average number of secondary recommendations 
produced by reached nodes as 



R() = Xry 



(3) 



This number is widely used in mathematical epidemiology 
(Hethcote, 2000) to determine the moment when a disease out- 
break becomes a self-sustaining epidemic. Thus, if Rq>\ the 
spreading process reaches the Tipping-point^ an elusive goal 
that none of our campaigns attained. Table 3 presents the prop- 
agation dynamic parameters and cascades average size s of our 
campaigns and their predicted value i* for the infinite propa- 
gation hmit given by the Galton- Watson Branching model as 



= 1 -I- 



l-/?n 



/?0<1 



(4) 



where r^ is the average number of messages sent by Seed Nodes 
and Ro the viral propagation Basic Reproductive Number. The 
last column in Table 3 shows the remarkable accuracy of the 
cascades average size predicted by the Galton- Watson Branch- 
ing model versus the empirical values. 

3. Patterns of the information cascades growth 

Despite the Galton- Watson model statistically accurate de- 
scription of the distribution of cascades at a global level, a de- 
tailed study of the Cascades Network growth, reveals pattems 
indicating that viral messages spreading dynamics is quite pe- 
cuhar. Firstly, we present a node level analysis showing the 
correlation in the spreading activity of a node with that of its 
active offspring down the message propagation tree. Secondly, 
we conduct a generation level analysis on the probabihty of 
the nodes becoming active as a function of their ordinal po- 
sition in the message diffusion path which shows that viral 
messages diffusion propensity increases with distance from the 
Seed Node. Both findings lead to a striking prediction corrob- 
orated by the measurements on our viral campaigns: The viral 
messages diffusion dynamic parameters at the population level 
are correlated, a fact that has not been observed in other so- 
cial dynamics processes such as irmovations adoption, rumors 
spreading or opinions propagation. Note that both findings are 
incompatible with the assumptions in the Galton- Watson model 
in which the branching mechanism is homogeneous both at the 
social network level and within the cascades. 



the same probabihty distribution. 

^ Defined by analogy to phase transitions in Physics as the process inflection 
point where propagation speed accelerates drastically and becomes unstopped 
so that the message propagation reaches a very large fraction of the audience. 
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Fig. 2. Active nodes correlated spreading: Active nearest neighbors average 
out-degree (r,,)a„„ (circles) of active nodes (ry > 1) in the campaigns as 
a function of their parents activity r,, in a base two logaritlimic binning. 
The linear fit positive slope (0.69) shows correlation between the spreading 
activity of a node and that of its active offspring in the propagation tree: the 
more active a node is, the more active its nearest neighbors in average are. 

3.1. Correlated spreading of active nodes 

The first distinctive pattern of the viral messages Cascades 
Network growth is the marked positive correlation of the spread- 
ing activity between Viral Nodes and their active off-spring. In 
undirected networks, the nodes total degree correlation is given 
by the conditional probability P{k \ k') of a node of degree k 
pointing to a node of degree k' . This function is very noisy in 
finite networks and is usually replaced by the average degree of 
the nearest neighbors of A;-degree nodes k„n{k) = Y!k^'P{k I 
(Boccaletti et al., 2006). When knn{k) is an increasing function 
of the degree k the nodes tend to connect to others of simi- 
lar connectivity and such network, called assortative, displays 
positive node total-degree correlations. 

However the active nodes network is directed and instead 
one should study its out-degree correlation defined as the ten- 
dency of nodes to connect with others that have similar out- 
degrees to themselves. Its formal metric is the out-assortativity 
coefficient ^ but considering throughout only the active nodes 
throughout a simplified analysis of the average out-degree of 
the active nearest neighbors {ry)ann of nodes of out-degree 
r,. > 1 presented in Fig. 2 suffices to prove that, in terms of the 
number of recommendations sent in our campaigns, the more 
active a node is the more prolific in average its progeny is. We 
studied the out-degree spreading pattern of active nodes in our 
campaigns {Seed Nodes excluded) and found that the activity 
of a node (r,.) correlates with that of its active nearest neigh- 
bors. Such correlation implies that the average number of rec- 
ommendations sent by the active nearest neighbors of a node 
(rv)ann grows with the number of recommendations r^ that it 



* A convoluted combination of the probability distributions of a link going 
out of a node of out-degree r^, of a Unk going into a node of out-degree 
r(, and the joint probability of links to go from a node of out-degree r,, to 
another of out-degree r,' (Piraveenan et al., 2009) 



has sent. The slope of the linear regression of (Fv)ann('"v) is 
-1-0.69 indicating strong out-degree correlation. The actual val- 
ues of (ry)a„„ range between 1 and 31.33, the mean of their 
distribution is 2.48 and its standard deviation 2.08. 

This very peculiar feature of viral messages diffusion has 
not been observed on any other type of propagation processes 
in social networks. We can hypothesize two different explana- 
tions of it. One, the increased spreading activity of the active 
children of a node is a reflection of the out-degree correla- 
tion present in the substrate email network. Lacking any data 
on such network for our campaigns this hypothesis is impossi- 
ble to verify. Besides, the out-degree positive correlation in the 
substrate email network merely means that its nodes tend to 
link to others of similar out-degree but does not by any means 
indicate that the number of recommendations made by active 
participants, hence the interest in participating in the campaign, 
should be a growing function of the number of recommenda- 
tions made by their parent in the cascade. The other possible 
explanation, which we adopt, is that the intrinsic mechanism 
whereby participants in viral marketing campaigns forward the 
messages involves the sender selecting targets among those of 
her contacts perceived to be the most receptive to the content 
of the message being passed-along. The iteration of these tar- 
get filtering decisions through several generations of senders 
would lead, in a process akin to targeted search, to focusing the 
message on groups of individuals genuinely interested on it. 
Those, in turn, would also be in average more active than their 
ancestors. The fact that this mechanism has not been observed 
in other types of information diffusion, such as referral net- 
works (Vilpponen et al., 2006), e-commerce recommendations 
(Leskovec et al., 2007) or email chain-letters (Liben-Nowefl 
and Kleinberg, 2008) may indicate either that the phenomenon 
is specific of viral marketing messages or that those authors 
analysis did not isolate the content factor. 

3.2. Diffusion acceleration with path length 

The second characteristic of viral spreading dynamics ap- 
pears when measuring the probability of the nodes becoming 
active spreaders as a function of their position in the propa- 
gation tree. Thus, the Transmissibility by generation Xg in our 
campaigns grows in correlation with the ordinal g representing 
the individuals' location in the message propagation path. As 
shown in Table 4 for the All markets data, Xg increases steadily 
with the generation (p(^|Ai;) = 0.908) with parallel growth of 
the Reproductive Number by generation 

R, = h(7^),^^ (5) 

where Ng is the total number of individuals reached at genera- 
tion g. Besides, there is a growth trend for the Fanout by 
generation which is visible in our campaigns (Table 4) whose 
Fanout ratio through generations (jy)g+i/{f^)g positively cor- 
relates with the generation number (p = 0.4). Those properties 
of messages diffusion were detected, but not studied, by Watts 
and Peretti (2007) or Leskovec et al. (2007) as shown in Fig. 3 
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0.0056 


2 
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393 
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0.0621 
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6 
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0.0030 
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0.1127 
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0.0612 
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0.0007 
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0.1765 
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0.353 


0.1765 
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0.0003 
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0.1667 


4.000 


0.667 


0.0 


9 


4 


0.0002 





0.0 


0.0 


0.0 


N/A 



Table 4 

Distribution of nodes by generation: Distribution of the node.s touched by 
the viral message diffusion in the graph of All markets by ordinal number g 
of the position in their diffusion path (generation). Ng is the number of nodes 
in generation g and Pg the probability of a node belonging to generation 
g>l. {Ny)g is the number of Viral Nodes by generation, kg the probability 
of nodes in generation g becoming spreaders and (r„)j the average number 
of recommendations sent by nodes in generation g and Rg the Reproductive 
Number by generation with SEM its standard error 

along with our campaigns measurements. As before, we posit 
that such pattern is due to "preferential forwarding," defined as 
the spreaders' propensity of passing a message preferentially to 
neighbors they presume to have more interest, or affinity, for it. 
Such mechanism results in an increase of the recipients propen- 
sity to pass the message along. As a consequence, the message 
follows network paths such that the Transmissibility by genera- 
tion A;, increases as the propagation progresses. We denominate 
Affinity Paths to the chains of individuals with similar or in- 
creasing affinity for the message. They imply some knowledge 
by message spreaders of their immediate neighbors interests, a 
local awareness with global impact that leads to a different class 
of propagation than that of other Social Dynamics processes. 
Its consciously driven spreading mechanism causes messages 
to progress through paths presenting the homophily ^ proper- 
ties typical of social networks (McPherson et al., 2001). This 
phenomenon has been observed in the web where, according 
to Singla and Richardson (2008) "there is correlation between 
preferences and behavior of an individual and those of others 
in its immediate circle". 

3.3. Dynamic Parameters correlation 

As a result of the previous two properties the parameters X 
and Ty are correlated. Let us consider the relationship between 
the Fanout Coefficient and the generation parameters in Table 4 



i:,=2Ng i-Pg{i) 



(6) 



where Pg{l) = A^i/Lg=i A/;? =Nsr,/{N-N,) is the probabil- 
ity of an individual to have received the message from a Seed 
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Fig. 3. Diffusion acceleration with path length: Reproductive Number by 
generation Rg in viral messages propagation. Solid ckcles with error bars 
correspond to our IT newsletter campaign. Other data sets (no error bars 
available): Oxygen Network advocacy portal collecting contributions for hur- 
ricane Katrina relief (squares); Tide Coldwater campaign for an energy-ef- 
ficient washing detergent (empty circles); StopTheNRA, an appeal for gun 
control launched by the father of a Columbine shootings victim (upward 
triangles) per Watts and Peretti (2007); referrals in e-commerce (downward 
triangles) per Leskovec et al. (2007). 

Node. Since ^,,= 1 ^gPg — ^ one obtains the important expres- 
sion A?,, = \ — Pg{\) which means that for X and to increase 
simultaneously one must reduce the probability Pg{\) of find- 
ing nodes in the first generation or, equivalently, grow longer 
cascades. Thus, a growing Xg yields longer paths and causes 
a parallel growth of 7,,- Our campaigns show that the average 
shortest path length (l) of the diffusion cascades and the dy- 
namic parameters are strongly coiTelated: p(Z|?v) — 0.88 and 
p{p\X) = 0.89. An increase of the Transmissibility X grows 
the paths length and the average number of recommendations 
made as well. Plotting the dynamic parameters for various 
markets (Fig. 4) their correlation was found to be very strong 
with a Pearson coefficient p (A |r,,) = 0.92. The values of A and 
by country from Table 3 fit to the decreasing exponential 
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which for A ^ 1, and through a MacLaurin series expansion 
of e^'^, tuiTis into = 1 + flA (a = be). One can consider the 
slope a of this "response line" as the message "fitness" with 
respect to each market. The exponential decrease for large A in 
Eq. (7) is due to the substrate network nodes clustering which 
limits propagation through saturation and finite size effects. 

In principle this correlation between Fanout Coefficient and 
Transmissibility should invalidate the Gallon- Watson model 
used in Section 2.3, because that model assumes that those pa- 
rameters are uncorrelated. However, this is not the case since 
most of the participants in the campaign appear at very low 
generation numbers and thus the phenomena observed here is 
only a significant correction affecting a small fraction of par- 
ticipants. 



The tendency of individuals to associate and bond with similar others. 



Y intercept set to 1 since r,. — > 1 as A — > because fit is on active nodes. 
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Fig. 4. Dynamic Parameters Correlation: CoiTelation between the popu- 
lation level dynamic parameters Transmissihility (A) and Fanout Coefficient 
of our viral campaigns in different markets. Dotted line is the linear fit 
to r^. = \+aX with a = 22.48 and = 0.843. Solid line is the exponential 
fit to fy = 1 +b{l-e-^'') with fo = 3.82, c= 8.44 and = 0.818. Markets 
position towards the rightmost side of this "response line" indicates a higher 
affinity of the audience with the campaign message. Number of spreading 
(active) nodes shown under market name. 

4. The Message Affinity Model (MAM) 

The correlation between the messages propagation dynamic 
parameters A and F,, and the independence of the nodes spread- 
ing activity from the substrate email network they run upon are 
intriguing properties of the viral marketing diffusion processes. 
Watts and Dodds (2007) built a model proving that informa- 
tion propagation can happen independently of the underlying 
social network structure and concluded that "large cascades of 
influence are diiven not by the influential but by a critical 
mass of easily influenced individuals." However, their model 
does not explain the dynamic parameters correlation nor the in- 
crease with the generation of the nodes propensity of becoming 
spreaders. We posit that both features are due to the fact that 
the decisions of forwarding a viral message and of the number 
of neighbors to send it to, typically made in a single act by each 
forwarding individual, are correlated and that such correlation 
emerges as a function only of their affinity with the content of 
the message being spread. 

The agent-based Message Affinity Model (MAM) incorpo- 
rates that mechanism by assigning to the substrate network 
nodes a propensity value representing their affinity with the 
message being forwarded. Fuithermore, the model propagation 
rules combine a valiant of the states transition steps of the SIR 
epidemic model on networks (Pastor-SatoiTas and Vespignani, 
2001) with the stochastic evolution of a pseudo-markovian " 
Galton- Watson Branching model. At any step, the network 
nodes are in one of the following three states: 



" The Galton-Watson Branching model used in Section 2.3 explains well the 
growth of the cascades at the average level but fails to predict the activity 
correlations that appear in the evolution through generations. This is because 
the Galton-Watson model stochastic process is mai'kovian while, in reality, 
one node's activity depends on that of its parent. 



- Susceptible (S): Node has not received the message 

- Informed (I): Node is propagating the message 

- Refractory (R): Node does not spread the message anymore 

Unlike the SIR model, MAM does not use a global proba- 
bility for the nodes states transitions. Instead, they stem from 
the aggregate decisions that result from the interplay between 
the nodes pass-along propensity and the message "fitness" to 
diffuse. Drawn from a continuous probability density function 
p{a), the Affinity a„ G [0,1] of a node represents its propen- 
sity to engage in spreading the message. The message fitness 
to trigger the node activations is represented by their Affinity 
Threshold At G [0, 1], the lowest a„ value for which such mes- 
sage can push the node into the Informed state: low threshold 
messages are capable of activating more nodes and are, as a 
result, forwarded more often than high threshold ones. The pro- 
cess starts by turning a random fraction of the substrate network 
nodes into the Informed state while leaving all others Suscep- 
tible. From that point onwards the following rules govern the 
stochastic propagation: 

(i) Susceptible nodes touched by the message become In- 
formed if their A ffinity is higher than the message thresh- 
old (a„ >At) and Refractory otherwise while, if touched, 
Informed or Refractory nodes stay unchanged. 

(ii) An Informed node n forwards a number of messages 
('"v)n = (fln —At) X r, with r drawn from a PL distribu- 
tion. The neighbors receiving those messages are 

(a) those with highest a„ with probability {a„ — At) 

(b) chosen randomly with probability 1 — (a„ — At) 

(iii) Informed nodes become Refractory immediately after 
spreading the message and the process ends when no In- 
formed nodes are left 

The quantity a„ — At embodies the interplay between the 
participants interests and the message content. The choice in 
Rule (ii) of the neighbors that will receive the message repre- 
sents the evaluation Informed nodes make, based on their lo- 
cal knowledge, of their neighbors' affinity. It implies that local 
knowledge grows with the Affinity: nodes of high a„ are more 
likely to choose targets with the highest propensity to pass the 
message while those with low a„ will mostly choose their tar- 
gets randomly. At may vary by individual but, without loss of 
generality, we take it constant including all variations in p{a). 

4. 1 . MAM Simulation Results 

Here we present the result of Monte Carlo simulations of vi- 
ral messages propagation ran with the MAM model and show 
that they replicate the patterns observed in real processes. The 
simulations ran on two substrate networks with the same de- 
gree distribution but different structure: the real email network 
of a Spanish university (Guimera et al., 2003) and a synthetic 
configuration model network built with the MoHoy and Reed 
method (Callaway et al., 2001). They differ in their Cluster- 
ing Coefficient {Cemaii = 0.22 vs. Cconf = 0.014) and in the fact 
that the email network node degrees are correlated while the 
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Fig. 5. Cascades Network Average Cascade Size. Expected average size 
of viral cascades for different values of tlie Reproductive Number Rq in 
simulations on the email (boxes) and Config. (circles) networks with uniform 
Affinity distribution of mean Z?„ =0.13 and a„ = 0. 17 respectively. The Affinity 
Threshold At used in the simulations ranges from 0.6 to 0.97 to show the 
asymptotic growth for ~ 1 . Solid lines are not a fit but the predictions of 
Eq. (4). Inset: Curves collapse when plotting (s— l)/7i against Rq showing 
the viral propagation patterns independence of the substrate network topology. 



configuration network ones are not. Their nodes Affinity, with 
correlation between nearest neighbors, was drawn from a uni- 
form distribution. The Cascades Network resulting from the 
propagation of messages with Affinity Threshold between 0.6 
and 0.97 were averaged over 15K cascades with 500 different 
allocations of the substrate nodes Affinity. 

The simulations generate graphs with a large number of dis- 
connected components that, like those in the real campaigns, 
feature distributions of Eq. (1) type for both their viral nodes 
activity P(r,,) and cascades size P{s). The exponents jk ™d js 
of their power-laws are in the range 1-3 depending on the val- 
ues of the model parameters nodes Affinity (a„) and message 
Affinity Threshold {Aj) used. Besides, the average cluster size 
of the graphs obtained in the simulations follows closely the 
branching model predictions as shown in Fig. 5. It plots the av- 
erage size s of the propagation network components obtained 
with different values of the message Affinity Threshold versus 
their reproductive number Rq for each. The lines are not a fit to 
the data but the prediction 5* given by Eq. (4). Notice their re- 
markable agreement and the fact, shown in the inset, that when 
the effect of Seed Nodes is removed by plotting (is — 1 ) /rs the 
results for the simulations on both substrate networks match 
exactly. This indicates that as our model predicts, for processes 
running well below the Tipping-point the impact of the sub- 
strate network in the cascades average size or the dynamic pa- 
rameters of the propagation is very low. 

The plot of the Cascades Network dynamic parameters in the 
main panel of Fig. 6 and their fit to Eq. (7) shows how MAM 
accurately replicates their correlation pattern. This proves that 
the viral messages propagation patterns are independent of the 
substrate network structure for low A. However their 7,, val- 
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Fig. 6. Correlations in MAM simulations. Main panel shows the correlation 
between dynamic parameters for simulations on a real email network substrate 
(boxes) and on an equivalent configuration model network (circles), both 
with uniform distribution of the nodes Affinity of mean a„ = 0.28. Numbers 
indicate the message Affinity Threshold (Aj) for each simulation. Fits are 
to Eq. (7) with parameters = 2.36, Cj„,„,7 = 5.06 and 

and b^g„f = 2.44, c^onf = 5.26 and R 



conf 



: 0.993 

; 0.976 for the respective network 
substrate. Inset: Evolution of the Transmissibility by generation iXg) for three 
of the simulations (Aj = 0.69 — 0.75 — 0.81) ran on the email network (empty 
symbols) compared with that of the real campaigns (full circles). 

ues diverge as X grows because the email network clustering 
and degree correlations accelerate saturation effects and curtail 
propagation. The diffusion acceleration with path length pre- 
sented in Fig. 3 and typical of viral messages propagation is 
also properly replicated with MAM. The inset of Fig. 7 presents 
the evolution of kg with g for simulations on the real email net- 
work (dotted lines) alongside that of our empirical results. The 
striking similarity of both up until § = 5 is quite significant. 
The low number of active nodes left in the substrate network 
beyond that point, renders the statistics of the results unreliable. 
The same pattern (not shown) appears for simulations on the 
configuration model network. The growth of Xg can not pre- 
vent the propagation process ending. In fact for Rg< 1, Xg is a 
probability below unity applied at each subsequent generation 
to an ever shrinking cohort of nodes. As proved by the Branch- 
ing Process theory, the cascades inevitably reach a point where 
there is no new offspring and they die off. Actually, even for 
Rg > 1 the cascades extinction has a non-zero probability that 
increases with the heterogeneity of the participants' activity 
distribution (Haiiis, 2002). 

5. Conclusions and Discussion 

We tracked and analyzed the structure and growth dynam- 
ics of the propagation network created by the diffusion of a 
content-controlled message in real viral marketing campaigns 
driven through email forwarding. The resulting Cascades Net- 
work, formed by almost pure trees of very low clustering, shows 
two striking dynamical patterns not observed so far in other 
Social Dynamics processes like rumor spreading, innovations 
adoption or email chain-letters. First, there is positive correla- 
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tion between the spreading nodes activity level as measured by 
their out-degree and that of their active off-spring and, second, 
the propensity of nodes reached by the message to becoming 
spreaders, the Transmissibility A, grows with those nodes depth 
in the propagation path. These novel properties can only be 
detected by scrutinizing the propagation of messages of fixed 
and identical content. The scarcity of such type of data may 
explain why they have remained unobserved until now. The 
discovered patterns have two remarkable consequences. On the 
one hand, the dynamic parameters Transmissibility and Fanout 
Coefficient for a given message across different markets are 
correlated. On the other, the topology of the email network un- 
derlying the propagation has limited influence on the Cascades 
Network although its features are compatible with the structure 
of the substrate email network that conditions their formation. 

Our explanation of all those peculiarities stems from the 
mechanism driving the messages propagation which involves 
the affinity of the campaign participants with the content of 
the message. Participants would make a simultaneous and con- 
scious decision of spreading it or not and to whom which leads 
to a the positive correlation between the probability of becom- 
ing a spreader after receiving the message and the average num- 
ber of messages forwarded. This decision would result from a 
single intrinsic property of the nodes in the substrate network, 
their affinity with the message being passed-along. Besides, the 
dynamic parameters by generation Xg and {Tg) tend to grow 
with g since the choice of targets to forward the message to is 
based on the participants' awareness of their neighbors' affin- 
ity with it. Such mechanism steers the message through paths 
of increased affinity termed Affinity Paths. 

This hypothesis is tested through an agent-based model 
(MAM) that replicates the patterns discovered and validates 
the proposed Affinity-driven information diffusion mechanism. 
It combines a stochastic branching process with propagation 
rules that create cascades of touched nodes by taking the sub- 
strate network nodes message awareness through a sequence 
of Susceptible, Informed and Reluctant states. The MAM uses 
just two control parameters: the Affinity distribution p{a) of the 
substrate network nodes to assign them an affinity value be- 
tween (message is not sent) and 1 (message will certainly be 
forwarded) and the Affinity Threshold At representing the mes- 
sage fitness to be passed-around. As the model runs through 
a substrate network list of edges, the interplay between Aj 
and the nodes Affinity generates cascades with all the expected 
features while providing a ghmpse into the substrate network 
topology. The empirical analysis and the theoretical model val- 
idate our conclusion that the mechanism driving viral market- 
ing messages propagation results from the affinity between the 
campaign participants' preferences and the messages content. 
In fact, the viral cascades features depend more on the indi- 
viduals' reaction to the message than on the substrate network 
topology. However, we could not verify this conclusion empir- 
ically since the structure of our campaigns substrate network 
being unknown, a comparison between the Cascades Network 
and the substrate email network was impossible. Also, MAM 
does not replicate the merging of cascades that occurs near 



the Tipping-point as it assumes that Seed Nodes are planted 
in a boundless network and far apart of each other to avoid 
propagation clashing. Finally, MAM only runs on undirected 
and fuUy connected networks. 
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